Finding top-k elements in data streams
نویسندگان
چکیده
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer’s behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms. 2010 Elsevier Inc. All rights reserved.
منابع مشابه
Efficient Computation of Frequent and Top-k Elements in Data Streams
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general d...
متن کاملFinding top-k elements in a time-sliding window
Identifying the top-k most frequent elements is one of the many problems associated with data streams analysis. It is a well-known and difficult problem, especially if the analysis is to be performed and maintained up to date in near real time. Analyzing data streams in time sliding window model is of particular interest as only the most recent, more relevant events are considered. Approximate ...
متن کاملMining top-k high utility patterns over data streams
Online high utility itemset mining over data streams has been studied recently. However, the existing methods are not designed for producing topk patterns. Since there could be a large number of high utility patterns, finding only top-k patterns is more attractive than producing all the patterns whose utility is above a threshold. A challenge with finding top-k high utility itemsets over data s...
متن کاملMatching Top-k Answers of Twig Patterns in Probabilistic XML
The flexibility of XML data model allows a more natural representation of uncertain data compared with the relational model. The top-k matching of a twig pattern against probabilistic XML data is essential. Some classical twig pattern algorithms can be adjusted to process the probabilistic XML. However, as far as finding answers of the top-k probabilities is concerned, the existing algorithms s...
متن کاملHow to select the largest k elements from evolving data?
In this paper we investigate the top-k-selection problem, i.e. determine the largest, second largest, ..., and the k-th largest elements, in the dynamic data model. In this model the order of elements evolves dynamically over time. In each time step the algorithm can only probe the changes of data by comparing a pair of elements. Previously only two special cases were studied [2]: finding the l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Inf. Sci.
دوره 180 شماره
صفحات -
تاریخ انتشار 2010